Skip to content

feat: add support for H200 accelerator type and AKS overlays [WIP]#485

Open
Jont828 wants to merge 2 commits intoNVIDIA:mainfrom
Jont828:h200-aks
Open

feat: add support for H200 accelerator type and AKS overlays [WIP]#485
Jont828 wants to merge 2 commits intoNVIDIA:mainfrom
Jont828:h200-aks

Conversation

@Jont828
Copy link
Copy Markdown
Contributor

@Jont828 Jont828 commented Apr 2, 2026

Register h200 as a supported accelerator type in the criteria system and add AKS-specific recipe overlays mirroring the existing H100 AKS set. The H200 (Standard_ND96isr_H200_v5) uses NVIDIA H200 Tensor Core GPUs with 141 GB HBM3e and 900 GB/s NVLink interconnect.

  • Add CriteriaAcceleratorH200 constant, parser case, and model matcher
  • Create 6 AKS overlay files (training, inference, ubuntu variants, kubeflow, dynamo) inheriting from aks-training/aks-inference bases
  • Add H200 UAT chainsaw tests for training and inference CUJs
  • Update unit tests for accelerator parsing and GPU model matching

Summary

Motivation / Context

Fixes:
Related:

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • CLI (cmd/aicr, pkg/cli)
  • API server (cmd/aicrd, pkg/api, pkg/server)
  • Recipe engine / data (pkg/recipe)
  • Bundlers (pkg/bundler, pkg/component/*)
  • Collectors / snapshotter (pkg/collector, pkg/snapshotter)
  • Validator (pkg/validator)
  • Core libraries (pkg/errors, pkg/k8s)
  • Docs/examples (docs/, examples/)
  • Other: ____________

Implementation Notes

Testing

# Commands run (prefer `make qualify` for non-trivial changes)
make qualify

Risk Assessment

  • Low — Isolated change, well-tested, easy to revert
  • Medium — Touches multiple components or has broader impact
  • High — Breaking change, affects critical paths, or complex rollout

Rollout notes:

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S) — GPG signing info

Register h200 as a supported accelerator type in the criteria system
and add AKS-specific recipe overlays mirroring the existing H100 AKS
set. The H200 (Standard_ND96isr_H200_v5) uses NVIDIA H200 Tensor Core
GPUs with 141 GB HBM3e and 900 GB/s NVLink interconnect.

- Add CriteriaAcceleratorH200 constant, parser case, and model matcher
- Create 6 AKS overlay files (training, inference, ubuntu variants,
  kubeflow, dynamo) inheriting from aks-training/aks-inference bases
- Add H200 UAT chainsaw tests for training and inference CUJs
- Update unit tests for accelerator parsing and GPU model matching

Signed-off-by: Jont828 <jt572@cornell.edu>
@Jont828 Jont828 requested review from a team as code owners April 2, 2026 22:15
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 2, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@Jont828 Jont828 changed the title feat: add support for H200 accelerator type and AKS overlays feat: add support for H200 accelerator type and AKS overlays [WIP] Apr 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants